Now that we have developed several ways to create categorical palettes, we can appreciate how many times different clusterings on different colorspaces create similar colors. If the colors are too similar, the palette should be modified (e.g. rotating, or choosing a different colorspace), but if they are similar but still distinguishable, they can still be used. However, in a 2D map such as a tSNE plot, similar clusters of dots may be assigned similar colors, making their interpretation more difficult. The issue is a “map colouring” issue, with the added constraint of using every colour once. I have attempted at giving a solution combining two closely related fields: graph theory and computational geometry.
I will illustrate the procedure step by step.
This was done in the palette tools page, and is quite simple. The number of clusters is for convenience the number of colors in the palette. We will use the CLARA RGB palette for this, which has 8 colors. We label each centroid on the plot with the corresponding cluster number.
getCenter <- function(o)
{
xy = 1:2
xy[1] = mean(o[,1])
xy[2] = mean(o[,2])
return(xy)
}
tmp = dist(tsne)
hc = hclust(tmp)
cutt = cutree(hc, k = 8)
dat = as.data.frame(cbind(tsne,cutt)) #Add the clustering assignment to the tSNE coordinates
plot(dat[,1], dat[,2], col = "gray", pch = 16, xlab = NA, ylab = NA)
centers = matrix(0, nrow = 8, ncol = 2)
for(i in unique(cutt)) centers[i,] = getCenter(tsne[cutt == i,])
points(centers[,1], centers[,2], cex = 3, pch = 21, bg = "white")
text(centers[,1], centers[,2], labels = 1:8, font = 2)
claram = cluster::clara(mm[,1:3], k = 8) #The syntax of clara is the same as k-means
claram = claram$medoids #Take medoids from the clara object
clcols = vector()
for(i in 1:8)
{
clcols[i] <- mm[which(mm[,3] == claram[i,1] & mm[,1] == claram[i,2] & mm[,2] == claram[i,3]),4]
}
set.seed(10)
clcols = clcols[sample(1:length(clcols), length(clcols))]
The Delauney triangulation of a set of points connects points in such a way that no point in this set is inside the circumcircle of any triangle of the triangulation. Another interesting property is that it contains the Euclidean minimum spanning tree of the set of points. Thus the Delauney triangulation of a set of points won’t connect all of them to each other, but will create enough edges that a shortest path to visit all of them can be found, and points close in space will be connected more likely than those farther away. In practical terms it is a way of reducing the set of distances to be calculated, so that the computation of the shortest path can be quick and find easily the most connected centerpoints. We will compute the triangulation using the deldir package.
library(deldir)
vtess <- deldir::deldir(centers[,1], centers[,2], rw = c(range(dat[,1]), range(dat[,2])))
plot(dat[,1], dat[,2], col = "gray", pch = 16, xlab = NA, ylab = NA)
plot(vtess, wlines="triang", wpoints="dummy", number=FALSE, add=TRUE, lty=1, col = "black")
points(centers[,1], centers[,2], cex = 3, pch = 21, bg = "white")
text(centers[,1], centers[,2], labels = 1:8, font = 2)
We then calculate pairwise distances among all points, but only retain those in the nodes connected by the Delauney triangulation. Then we create a matrix that is suitable for shortest path calculation.
dat.tmp = centers
dist.tmp = dist(dat.tmp)
tri <- vtess$delsgs
tri$dist = tri$ind1
for(i in 1:nrow(tri))
{
tri$dist[i] = as.matrix(dist.tmp)[tri$ind1[i], tri$ind2[i]]
}
arcs = tri[,5:7]
colnames(arcs) = c("source", "target", "weight")
arcs = as.matrix(arcs)
Now we apply the Bellman-Ford shortest path algorithm on the Delauney graph, identifying the shortest path to all the other nodes starting from each node. The weights are the Euclidean distances between the edges. We use the optrees package to perform this search.
BF_dists = list()
dists = vector()
for(k in 1:max(unique(cutt)))
{
BF_dists[[k]] = optrees::getShortestPathTree(1:max(unique(cutt)), arcs, source.node = k, algorithm = "Bellman-Ford", directed = F, show.data = F, show.graph = F, show.distances = F)
dists[k] = sum(BF_dists[[k]]$distances)
}
The dists vector contains the sum of distances walked in the calculation of the shortest path for each node, so we want to take the node with the smallest distance walked:
initial = which(dists == min(dists))
We will begin from centroid number 6, using a random color from our palette:
set.seed(82)
colordf = data.frame(1:8, rep("gray", 8), stringsAsFactors = F)
colnames(colordf) = c("cluster", "color")
colordf[initial, 2] = sample(clcols, 1)
plot(dat[,1], dat[,2], col = colordf[dat[,3],2], pch = 16, xlab = NA, ylab = NA)
We then need to go by closest to most distant centroid:
nextcols = as.numeric(names(as.matrix(dist.tmp)[initial,][order(as.matrix(dist.tmp)[initial,])][2:ncol(as.matrix(dist.tmp))]))
nextcols
## [1] 3 4 2 1 7 8 5
And to these centroids we assign colors in order of \(\Delta\)E, the “perceived difference” in colors. Two colors with a \(\Delta\)E below 2.3 are impossible to distinguish by the human eye.
dist_to_initial = vector()
for(i in 1:length(clcols)) dist_to_initial[i] = colorscience::deltaE1976(grDevices::convertColor(hex2RGB(colordf[initial, 2])@coords, from = "sRGB", to = "Lab"), grDevices::convertColor(hex2RGB(clcols[i])@coords, from = "sRGB", to = "Lab")
)
names(dist_to_initial) = clcols
dist_to_initial = dist_to_initial[order(dist_to_initial, decreasing = T)]
dist_to_initial = dist_to_initial[1:(length(dist_to_initial)-1)]
colordf[nextcols,2] = names(dist_to_initial)
We can now colour our tSNE plot with a qualitative palette in order to highlight clusters that are close on the plot:
plot(dat[,1], dat[,2], col = colordf[dat[,3],2], pch = 16, xlab = NA, ylab = NA)
Although distances from the starting centroid are obviously calculated on two dimensions, when we sort them we are actually flattening them in one dimension. This means that even though two centroids (e.g. centroids 2 and 7) are not immediately one after the other in terms of distance from centroid 6, they are close together in space. We want to see how distinctive this color assignment is in terms of color distance. Since we already computed the Delauney triangulation for this set of points, we can join the centers of the circumcircles in which each Delauney triangle is inscribed, thus obtaining the Voronoi diagram for the centroids. We can see that there is a good agreement between the Voronoi cells and the actual clusters:
vtiles = tile.list(vtess)
plot(vtiles, fill = colordf$color)
points(dat[,1], dat[,2], col = "black", bg = colordf[dat[,3],2], pch = 21, xlab = NA, ylab = NA, cex = 0.7)
points(centers[,1], centers[,2], cex = 3, pch = 21, bg = "white")
text(centers[,1], centers[,2], labels = 1:8, font = 2)
If two Voronoi tessels share an edge, they are obviously adjacent in space, and as we saw earlier it is a good approximation for clusters being adjacent to each other. To find shared edges we will use the spatstat package, and a function found on this answer on Stack Overflow.
library(spatstat)
cps <- ppp(centers[,1], centers[,2],window=owin(range(dat[,1]), range(dat[,2])))
sharededge <- function(X) {
verifyclass(X, "ppp")
Y <- X[as.rectangle(X)]
dX <- deldir(Y)
DS <- dX$dirsgs
xyxy <- DS[,1:4]
names(xyxy) <- c("x0","y0","x1","y1")
sX <- as.psp(xyxy,window=dX$rw)
marks(sX) <- 1:nobjects(sX)
sX <- sX[as.owin(X)]
tX <- tapply(lengths.psp(sX), marks(sX), sum)
jj <- as.integer(names(tX))
ans <- data.frame(ind1=DS[jj,5],
ind2=DS[jj,6],
leng=as.numeric(tX))
return(ans)
}
shared_edge_lengths <- sharededge(cps)
We now check which colors are within each shared edge, and their \(\Delta\)E distances:
edgecolors = shared_edge_lengths
edgecolors$col1 = colordf[edgecolors$ind1,2]
edgecolors$col2 = colordf[edgecolors$ind2,2]
edgecolors$DeltaE.dist = apply(edgecolors, 1,
function (x) colorscience::deltaE1976(grDevices::convertColor(hex2RGB(x[4])@coords, from = "sRGB", to = "Lab"), grDevices::convertColor(hex2RGB(x[5])@coords, from = "sRGB", to = "Lab")))
By plotting all \(\Delta\)E distances we see which two adjacent tessels have the lowest \(\Delta\)E. For this palette and with this combination the distance values are fairly good (always above 50), so that any assignment is quite distinguishable.
plot(1:nrow(edgecolors), rep(0, nrow(edgecolors)), cex = 0, ylim = c(-1, max(edgecolors$DeltaE.dist)+10), xaxt = "n", xlab = NA, ylab = "DeltaE color distance")
segments(x0 = 1:nrow(edgecolors), y0 = rep(0, nrow(edgecolors)), x1 = 1:nrow(edgecolors), y1 =edgecolors$DeltaE.dist)
points(1:nrow(edgecolors), rep(0, nrow(edgecolors)), cex = 4, pch = 15, col = edgecolors$col1)
points(1:nrow(edgecolors), edgecolors$DeltaE.dist, cex = 4, pch = 15, col = edgecolors$col2)
#bpl = barplot(edgecolors$DeltaE.dist, plot = F)
axis(1, at = 1:nrow(edgecolors), labels = paste(edgecolors$ind1, edgecolors$ind2, sep = "-"), las = 2)
We can also check all possible distances to have a sense of how much better this procedure was:
distpairs = expand.grid(1:8, 1:8)
distpairs$Col1 = colordf[distpairs$Var1,2]
distpairs$Col2 = colordf[distpairs$Var2,2]
distpairs$DeltaE.dist = apply(distpairs[,1:2], 1,
function (x) colorscience::deltaE1976(grDevices::convertColor(hex2RGB(colordf[x[1],2])@coords, from = "sRGB", to = "Lab"), grDevices::convertColor(hex2RGB(colordf[x[2],2])@coords, from = "sRGB", to = "Lab")))
plot(1:nrow(distpairs), rep(0, nrow(distpairs)), cex = 0, ylim = c(-1, max(distpairs$DeltaE.dist)+10), xaxt = "n", xlab = NA, ylab = "DeltaE color distance")
segments(x0 = 1:nrow(distpairs), y0 = rep(0, nrow(distpairs)), x1 = 1:nrow(distpairs), y1 =distpairs$DeltaE.dist)
points(1:nrow(distpairs), rep(0, nrow(distpairs)), cex = 2, pch = 15, col = distpairs$Col1)
points(1:nrow(distpairs), distpairs$DeltaE.dist, cex = 2, pch = 15, col = distpairs$Col2)
#bpl = barplot(edgecolors$DeltaE.dist, plot = F)
axis(1, at = 1:nrow(distpairs), labels = paste(distpairs$Var1, distpairs$Var2, sep = "-"), las = 2)
It is important to note that had we started with a different color in the most connected centroid, we would have generated a different order of colours, maybe maximizing the distances between shared edges. So the next logical step in this procedure is to generate an assignment for every color in the starting point, look at the distribution of color distances, and choose the assignment with the highest number of large distances:
colorassignments = list()
listcolordf = list()
for(j in 1:length(clcols))
{
colordf = data.frame(1:length(clcols), rep("gray", length(clcols)), stringsAsFactors = F)
colordf[initial, 2] = clcols[j]
dist_to_initial = vector()
for(i in 1:length(clcols))
{
dist_to_initial[i] = colorscience::deltaE1976(grDevices::convertColor(hex2RGB(colordf[initial, 2])@coords, from = "sRGB", to = "Lab"), grDevices::convertColor(hex2RGB(clcols[i])@coords, from = "sRGB", to = "Lab"))
}
names(dist_to_initial) = clcols
dist_to_initial = dist_to_initial[order(dist_to_initial, decreasing = T)]
dist_to_initial = dist_to_initial[1:(length(dist_to_initial)-1)]
colordf[nextcols,2] = names(dist_to_initial)
edgecolors = shared_edge_lengths
edgecolors$col1 = colordf[edgecolors$ind1,2]
edgecolors$col2 = colordf[edgecolors$ind2,2]
edgecolors$DeltaE.dist = apply(edgecolors, 1,
function (x) colorscience::deltaE1976(grDevices::convertColor(hex2RGB(x[4])@coords, from = "sRGB", to = "Lab"), grDevices::convertColor(hex2RGB(x[5])@coords, from = "sRGB", to = "Lab")))
colorassignments[[j]] = edgecolors
listcolordf[[j]] = colordf
}
mindists = unlist(lapply(colorassignments, function (x) min(x$DeltaE.dist[x$DeltaE.dist > 0])))
best.assignment = which(mindists == max(mindists))
if(length(best.assignment) > 1) best.assignment = best.assignment[1]
best.colordf = listcolordf[[best.assignment]]
plot(dat[,1], dat[,2], col = best.colordf[dat[,3],2], pch = 16, xlab = NA, ylab = NA)
points(centers[,1], centers[,2], cex = 3, pch = 21, bg = "white")
text(centers[,1], centers[,2], labels = 1:length(clcols), font = 2)
And their distances:
plot(1:nrow(colorassignments[[best.assignment]]), rep(0, nrow(colorassignments[[best.assignment]])), cex = 0, ylim = c(-1, max(colorassignments[[best.assignment]]$DeltaE.dist)+10), xaxt = "n", xlab = NA, ylab = "DeltaE color distance")
segments(x0 = 1:nrow(colorassignments[[best.assignment]]), y0 = rep(0, nrow(colorassignments[[best.assignment]])), x1 = 1:nrow(colorassignments[[best.assignment]]), y1 =colorassignments[[best.assignment]]$DeltaE.dist)
points(1:nrow(colorassignments[[best.assignment]]), rep(0, nrow(colorassignments[[best.assignment]])), cex = 4, pch = 15, col = colorassignments[[best.assignment]]$col1)
points(1:nrow(colorassignments[[best.assignment]]), colorassignments[[best.assignment]]$DeltaE.dist, cex = 4, pch = 15, col = colorassignments[[best.assignment]]$col2)
#bpl = barplot(edgecolors$DeltaE.dist, plot = F)
axis(1, at = 1:nrow(colorassignments[[best.assignment]]), labels = paste(colorassignments[[best.assignment]]$ind1, colorassignments[[best.assignment]]$ind2, sep = "-"), las = 2)
abline(h = 40, lty = 2, lwd = 0.4)
Now we package everything in a function:
colorTSNE <- function(tsnecoords, palette, output.type = "tsneplot")
{
####This will be removed when the input is a tSNEplot with its own clustering!!!
tmp = dist(tsnecoords)
hc = hclust(tmp)
cutt = cutree(hc, k = length(palette))
dat = as.data.frame(cbind(tsnecoords,cutt)) #Add the clustering assignment to the tSNE coordinates
#####
centers = matrix(0, nrow = length(palette), ncol = 2)
for(i in unique(cutt))
{
centers[i,] = getCenter(tsne[cutt == i,])
}
#First step: Delauney triangulation and Voronoi tessellation
vtess <- deldir(centers[,1], centers[,2], rw = c(range(dat[,1]), range(dat[,2])))
dist.tmp = dist(centers)
tri <- vtess$delsgs
tri$dist = tri$ind1
for(i in 1:nrow(tri))
{
tri$dist[i] = as.matrix(dist.tmp)[tri$ind1[i], tri$ind2[i]]
}
arcs = as.matrix(tri[,5:7])
#Second step: Bellman-Ford shortest path calculation
BF_dists = list()
dists = vector()
for(k in 1:max(unique(cutt)))
{
BF_dists[[k]] = getShortestPathTree(1:max(unique(cutt)), arcs, source.node = k, algorithm = "Bellman-Ford", directed = F, show.data = F, show.graph = F, show.distances = F)
dists[k] = sum(BF_dists[[k]]$distances)
}
initial = which(dists == min(dists))
#Third step: cycle through all the colors and check the pairwise distance between Voronoi cells with shared edges choosing the best starting color
nextcols = as.numeric(names(as.matrix(dist.tmp)[initial,][order(as.matrix(dist.tmp)[initial,])][2:ncol(as.matrix(dist.tmp))]))
cps <- ppp(centers[,1], centers[,2],window=owin(range(dat[,1]), range(dat[,2])))
shared_edge_lengths <- sharededge(cps)
colorassignments = list()
listcolordf = list()
for(j in 1:length(palette))
{
colordf = data.frame(1:length(palette), rep("gray", length(palette)), stringsAsFactors = F)
colordf[initial, 2] = palette[j]
dist_to_initial = vector()
for(i in 1:length(palette))
{
dist_to_initial[i] = colorscience::deltaE1976(grDevices::convertColor(hex2RGB(colordf[initial, 2])@coords, from = "sRGB", to = "Lab"), grDevices::convertColor(hex2RGB(palette[i])@coords, from = "sRGB", to = "Lab"))
}
names(dist_to_initial) = palette
dist_to_initial = dist_to_initial[order(dist_to_initial, decreasing = T)]
dist_to_initial = dist_to_initial[1:(length(dist_to_initial)-1)]
colordf[nextcols,2] = names(dist_to_initial)
edgecolors = shared_edge_lengths
edgecolors$col1 = colordf[edgecolors$ind1,2]
edgecolors$col2 = colordf[edgecolors$ind2,2]
edgecolors$DeltaE.dist = apply(edgecolors, 1,
function (x) colorscience::deltaE1976(grDevices::convertColor(hex2RGB(x[4])@coords, from = "sRGB", to = "Lab"), grDevices::convertColor(hex2RGB(x[5])@coords, from = "sRGB", to = "Lab")))
colorassignments[[j]] = edgecolors
listcolordf[[j]] = colordf
}
mindists = unlist(lapply(colorassignments, function (x) min(x$DeltaE.dist[x$DeltaE.dist > 0])))
best.assignment = which(mindists == max(mindists))
if(length(best.assignment) > 1) best.assignment = best.assignment[1]
best.colordf = listcolordf[[best.assignment]]
colnames(best.colordf) = c("cluster", "color")
if(output.type == "tsneplot")
{
plot(dat[,1], dat[,2], col = best.colordf[dat[,3],2], pch = 16, xlab = NA, ylab = NA)
points(centers[,1], centers[,2], cex = 3, pch = 21, bg = "white")
text(centers[,1], centers[,2], labels = 1:length(palette), font = 2)
} else
if(output.type == "distanceplot")
{
plot(1:nrow(colorassignments[[best.assignment]]), rep(0, nrow(colorassignments[[best.assignment]])), cex = 0, ylim = c(-1, max(colorassignments[[best.assignment]]$DeltaE.dist)+10), xaxt = "n", xlab = NA, ylab = "DeltaE color distance")
segments(x0 = 1:nrow(colorassignments[[best.assignment]]), y0 = rep(0, nrow(colorassignments[[best.assignment]])), x1 = 1:nrow(colorassignments[[best.assignment]]), y1 =colorassignments[[best.assignment]]$DeltaE.dist)
points(1:nrow(colorassignments[[best.assignment]]), rep(0, nrow(colorassignments[[best.assignment]])), cex = 4, pch = 15, col = colorassignments[[best.assignment]]$col1)
points(1:nrow(colorassignments[[best.assignment]]), colorassignments[[best.assignment]]$DeltaE.dist, cex = 4, pch = 15, col = colorassignments[[best.assignment]]$col2)
axis(1, at = 1:nrow(colorassignments[[best.assignment]]), labels = paste(colorassignments[[best.assignment]]$ind1, colorassignments[[best.assignment]]$ind2, sep = "-"), las = 2)
abline(h = 40, lty = 2, lwd = 0.4)
} else
if(output.type == "assignment")
return(best.colordf)
}
A random color assignment would have generated some color pairs that are slightly harder to distinguish. However, this difficulty increases when we increase the number of colors generated by clustering. Let’s see it with 15 colors:
claram = cluster::clara(mm[,1:3], k = 15) #The syntax of clara is the same as k-means
claram = claram$medoids #Take medoids from the clara object
clcols = vector()
for(i in 1:15)
{
clcols[i] = mm[which(mm[,3] == claram[i,1] & mm[,1] == claram[i,2] & mm[,2] == claram[i,3]),4]
}
tmp = dist(tsne)
hc = hclust(tmp)
cutt = cutree(hc, k = 15)
dat = as.data.frame(cbind(tsne,cutt)) #Add the clustering assignment to the tSNE coordinates
plot(dat[,1], dat[,2], pch = 16, xlab = NA, ylab = NA, col = clcols[dat[,3]])
centers = matrix(0, nrow = 15, ncol = 2)
for(i in unique(cutt)) centers[i,] = getCenter(tsne[cutt == i,])
points(centers[,1], centers[,2], cex = 3, pch = 21, bg = "white")
text(centers[,1], centers[,2], labels = 1:length(clcols), font = 2)
We can see how many more instances of less distinguishable couples can occur:
colordf = data.frame(1:length(clcols), clcols, stringsAsFactors = F)
distpairs = expand.grid(1:length(clcols), 1:length(clcols))
distpairs$Col1 = colordf[distpairs$Var1,2]
distpairs$Col2 = colordf[distpairs$Var2,2]
distpairs$DeltaE.dist = apply(distpairs[,1:2], 1,
function (x) colorscience::deltaE1976(grDevices::convertColor(hex2RGB(colordf[x[1],2])@coords, from = "sRGB", to = "Lab"), grDevices::convertColor(hex2RGB(colordf[x[2],2])@coords, from = "sRGB", to = "Lab")))
plot(1:nrow(distpairs), rep(0, nrow(distpairs)), cex = 0, ylim = c(-1, max(distpairs$DeltaE.dist)+10), xaxt = "n", xlab = NA, ylab = "DeltaE color distance")
segments(x0 = 1:nrow(distpairs), y0 = rep(0, nrow(distpairs)), x1 = 1:nrow(distpairs), y1 =distpairs$DeltaE.dist)
points(1:nrow(distpairs), rep(0, nrow(distpairs)), cex = 0.5, pch = 15, col = distpairs$Col1)
points(1:nrow(distpairs), distpairs$DeltaE.dist, cex = 0.5, pch = 15, col = distpairs$Col2)
#bpl = barplot(edgecolors$DeltaE.dist, plot = F)
#axis(1, at = 1:nrow(distpairs), labels = paste(distpairs$Var1, distpairs$Var2, sep = "-"), las = 2)
abline(h = 40, lty = 2, lwd = 0.4)
Whereas using this procedure:
colorTSNE(tsne, clcols, output.type = "tsneplot")
The distances seem to be optimal:
colorTSNE(tsne, clcols, output.type = "distanceplot")
A quantitative palette such as those generated by the viridis package should not be used as a qualitative one, but it represents an interesting limit case due to the fact that colors are interpolated and will be similar by definition. We try using 10 colors:
colorTSNE(tsne, viridis(10, option = "B"), output.type = "tsneplot")
Distances are not optimal:
colorTSNE(tsne, viridis(10, option = "B"), output.type = "distanceplot")
As expected, quantitative palettes are hard to assign and should not be used for this task.
We can use more qualitative palettes e.g. from the great qualpalr package, which chooses colors from the DIN99 space (a perceptual colorspace in which the Euclidean distance is linear with the color distance) and applies power transformations (you can read more about it in the vignette):
qualpalette = qualpalr::qualpal(10, "pretty")
colors = qualpalette$hex
colorTSNE(tsne, colors, "tsneplot")
As expected, distances are ok. There is a pair below 40 but it is still pretty distinguishable.
colorTSNE(tsne, colors, output.type = "distanceplot")
Back to top
Back to the index page.